Another Look at the Data Sparsity Problem
نویسندگان
چکیده
Performance on a statistical language processing task relies upon accurate information being found in a corpus. However, it is known (and this paper will confirm) that many perfectly valid word sequences do not appear in training corpora. The percentage of n-grams in a test document which are seen in a training corpus is defined as n-gram coverage, and work in the speech processing community [7] has shown that there is a correlation between n-gram coverage and word error rate (WER) on a speech recognition task. Other work (e.g. [1]) has shown that increasing training data consistently improves performance of a language processing task. This paper extends that work by examining n-gram coverage for far larger corpora, considering a range of document types which vary in their similarity to the training corpora, and experimenting with a broader range of pruning techniques. The paper shows that large portions of language will not be represented within even very large corpora. It confirms that more data is always better, but how much better is dependent upon a range of factors: the source of that additional data, the source of the test documents, and how the language model is pruned to account for sampling errors and make computation reasonable.
منابع مشابه
A NOVEL FUZZY-BASED SIMILARITY MEASURE FOR COLLABORATIVE FILTERING TO ALLEVIATE THE SPARSITY PROBLEM
Memory-based collaborative filtering is the most popular approach to build recommender systems. Despite its success in many applications, it still suffers from several major limitations, including data sparsity. Sparse data affect the quality of the user similarity measurement and consequently the quality of the recommender system. In this paper, we propose a novel user similarity measure based...
متن کاملFeature Learning for Multitask Learning Using l2,1 Norm Minimization
We look at solving the task of Multitask Feature Learning by way of feature selection. We find that formulating it as an l2,1 norm minimization problem helps with both sparsity and uniformity, which are required for Multitask Feature Learning. Two approaches are explored, one which formulates an equivalent convex optimization problem and iteratively solves it, and another which formulates two e...
متن کاملA Sharp Sufficient Condition for Sparsity Pattern Recovery
Sufficient number of linear and noisy measurements for exact and approximate sparsity pattern/support set recovery in the high dimensional setting is derived. Although this problem as been addressed in the recent literature, there is still considerable gaps between those results and the exact limits of the perfect support set recovery. To reduce this gap, in this paper, the sufficient con...
متن کاملAnother Look at the Hypocrisy of Chaucer’s Pardoner
For us, readers of Chaucer living in an age when appeal to religious passions and sentiments as a means for the realization of worldly objectives by some charlatans has grown significantly, reviewing the theme of religious hypocrisy treated in The Canterbury Tales can be useful in a way that it proves a helpful means for recognizing and dealing with the hypocrites. The Pardoner of the Tales is ...
متن کاملSOLVING FUZZY LINEAR PROGRAMMING PROBLEMS WITH LINEAR MEMBERSHIP FUNCTIONS-REVISITED
Recently, Gasimov and Yenilmez proposed an approach for solving two kinds of fuzzy linear programming (FLP) problems. Through the approach, each FLP problem is first defuzzified into an equivalent crisp problem which is non-linear and even non-convex. Then, the crisp problem is solved by the use of the modified subgradient method. In this paper we will have another look at the earlier defuzzifi...
متن کاملPrediction of blood cancer using leukemia gene expression data and sparsity-based gene selection methods
Background: DNA microarray is a useful technology that simultaneously assesses the expression of thousands of genes. It can be utilized for the detection of cancer types and cancer biomarkers. This study aimed to predict blood cancer using leukemia gene expression data and a robust ℓ2,p-norm sparsity-based gene selection method. Materials and Methods: In this descriptive study, the microarray ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006